Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Atomic migrations #7909

Merged
merged 5 commits into from
Jan 21, 2020
Merged

Atomic migrations #7909

merged 5 commits into from
Jan 21, 2020

Conversation

chrismwendt
Copy link
Contributor

@chrismwendt chrismwendt commented Jan 21, 2020

tl;dr this makes migrations atomic, which ensures that successful migrations always leave the DB in a clean state.

Prior to this change, it was possible for the frontend to disconnect from Postgres halfway through a migration (e.g. when the migration takes >5 minutes and k8s kills the frontend because it's not responding), leaving the DB marked as being in a dirty state despite the migration eventually succeeding.

After this change, each migration will set dirty=false within the transaction to guarantee that when a migration succeeds the DB will be in a clean state.

More context:

A few alternatives:

  • Accept that migrations can fail in this way and document how to fix the DB when this happens
  • Use my patched version of golang-migrate until a satisfactory solution has been upstreamed
  • Mitigation: increase the timeout for DB migrations
  • Don't merge slow migrations into master, somehow check perf before merging
  • Workaround 1: SET version=..., dirty=false at the end of each migration in our migrations/ directory
  • Workaround 2 (implemented in this PR): splice that statement into the migration at runtime by intercepting the migration loader
  • Workaround 3: inline the ~40 lines of relevant code from golang-migrate and patch that in our source

3 teammates supported workaround 3:

I vote for workaround 3, since we’re postgres only and the value-add from that lib is not that high. We can also do more postgres specific things if we go workaround 3 route and do stuff more tailored towards our exact needs.

Upon further consideration, it became more clear to me that workaround 2 would be better:

@eric and I chatted about this in depth and realized that workaround 3 is not very desirable because it would not play well with the migrate command line tool.

Workaround 2 is more compatible and doesn't change the control flow as much.

Given that, we plan to proceed with workaround 2.

This PR is based on https://github.com/sourcegraph/sourcegraph/pull/7905, which is expected to be merged before this PR.

@chrismwendt chrismwendt requested review from efritz and a team January 21, 2020 01:06
@chrismwendt chrismwendt force-pushed the atomic-migrations branch 2 times, most recently from 476af06 to c2f18e8 Compare January 21, 2020 03:23
@unknwon unknwon changed the base branch from master to clean-up-migrate-db January 21, 2020 04:28
@unknwon
Copy link
Contributor

unknwon commented Jan 21, 2020

FYI, I changed the merge base branch to the clean-up-migrate-db for easier review (real change set for this PR):

image

Don't forget to change it back to master once #7905 is merged :)

return nil, errors.New("expected migration filename to be of the form <version>_<name>")
}
version := versionParts[0]
newContents := strings.Replace(string(oldContents), "COMMIT;", fmt.Sprintf("UPDATE schema_migrations SET version='%s', dirty=false;\nCOMMIT;", version), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect golang-migrate to already update the version in the database. If that premise is valid, why do we need to re-update it? Would it be enough to just set dirty=false (and avoid all the version parsing code)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be enough to just set dirty=false (and avoid all the version parsing code)?

Oh right, yes, golang-migrate sets the version so we can drop the version parsing code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you can't!

If you set dirty=false but not the version, the frontend will not be around to update the version (necessarily) so you'll have applied the migration but the version will be the old one. This will cause the next frontend attempt to fail (forever).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efritz golang-migrate sets the version and dirty=true prior to the migration.

https://github.com/golang-migrate/migrate/blob/v4.8.0/migrate.go#L739-L754

			// set version with dirty state
			if err := m.databaseDrv.SetVersion(migr.TargetVersion, true); err != nil {
				return err
			}

			if migr.Body != nil {
				m.logVerbosePrintf("Read and execute %v\n", migr.LogString())
				if err := m.databaseDrv.Run(migr.BufferedBody); err != nil {
					return err
				}
			}

			// set clean state
			if err := m.databaseDrv.SetVersion(migr.TargetVersion, false); err != nil {
				return err
			}

I believe that means the frontend doesn't need to be around to update the version.

I won't merge this in until we reach consensus here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I thought it only set dirty=true first, then dirty=false, version=target at the end.

I'm convinced you can remove it now.

@codecov
Copy link

codecov bot commented Jan 21, 2020

Codecov Report

Merging #7909 into clean-up-migrate-db will increase coverage by <.01%.
The diff coverage is 62.5%.

@@                   Coverage Diff                   @@
##           clean-up-migrate-db    #7909      +/-   ##
=======================================================
+ Coverage                40.57%   40.57%   +<.01%     
=======================================================
  Files                     1279     1279              
  Lines                    66889    66896       +7     
  Branches                  6320     6320              
=======================================================
+ Hits                     27139    27144       +5     
- Misses                   37277    37278       +1     
- Partials                  2473     2474       +1
Impacted Files Coverage Δ
internal/db/dbutil/dbutil.go 17.81% <62.5%> (+2.24%) ⬆️

@chrismwendt chrismwendt merged commit c809444 into clean-up-migrate-db Jan 21, 2020
@chrismwendt chrismwendt deleted the atomic-migrations branch January 21, 2020 21:22
@chrismwendt
Copy link
Contributor Author

I merged this into the wrong branch 🤦 It never made it into master.

CleanShot 2021-09-17 at 16 20 46@2x

I'm attempting this again in (TODO make PR)

@efritz
Copy link
Contributor

efritz commented Sep 19, 2021

@chrismwendt I think there's a problem in this solution as well. If the migration continues to run after the timeout (and the client disconnects), then it may be the case that a new frontend will try to re-attempt the migration and fail. (This is incorrect since the dirty flag is still visible while that query is running).

This could cause a lot of noise during long migrations that still result in a CrashLoopBackoff that might eventually resolve without user intervention.

Like, it's a better state than we're in, but it's not nearly as good as what RFC 469's proposal guarantees (imo).

@chrismwendt
Copy link
Contributor Author

@efritz It looks like golang-migrate acquires an advisory lock before checking the dirty state, so frontends won't see the dirty state:

CleanShot 2021-09-19 at 12 22 53@2x

If it did it the other way around, then on upgrade all but then frontend that first began the migration would throw errors about a dirty DB (which doesn't happen).

Although the frontend isn't around to explicitly release the lock, Postgres automatically releases it once the session ends (which I'm guessing is immediately after the SQL commands finish):

CleanShot 2021-09-19 at 12 42 41@2x

I tried interrupting migrations in a few different ways to see the behavior:

migration-interruptions.mp4

It seems to me that SET dirty = 'f' within the migration produces desirable behavior, but it's been a long time since I first looked at this. Am I misunderstanding anything here?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants